Everything about Chemical Database totally explained
A
chemical database is a
database specifically designed to store
chemical information. Most chemical databases store information on stable
molecules.
Chemical structures are traditionally represented using lines indicating
chemical bonds between
atoms and drawn on paper (2D
structural formulae). While these are ideal visual representations for the
chemist, they're unsuitable for computational use and especially for
search and
storage. Small molecules (also called
ligands in drug design applications), are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks.
Large chemical databases are expected to handle the storage and searching of information on millions of molecules taking
terabytes of physical memory.
Representation
There are two principal techniques for representing chemical structures in digital databases
These approaches have been refined to allow representation of
stereochemical differences and charges as well as special kinds of bonding such as those seen in
organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.
Search
Substructure
Chemists can search databases using parts of structures, parts of their
IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for
subgraph isomorphism (sometimes also called a
monomorphism) and is a widely studied application of
Graph theory. The algorithms for searching are computationally intensive, often of
O (
n3) or
O (
n4) time complexity (where
n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of
Ullman's algorithm or variations of it. Speedups are achieved by
time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of
bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it's possible to eliminate the need for ABAS comparison with target molecules that don't possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of
hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.
Conformation
Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that's particularly of use in
drug design. Searches of this kind can be computationally very expensive, however new algorithms such as the
Ultrafast Shape Recognition algorithm devised by Pedro Ballester and Graham Richards of
Oxford University help speed up search by encoding shape parameters into a small set of parameters such as the first three moments (
mean,
variance and
skewness) of the distributions of interatomic distances.
Descriptors
All properties of molecules beyond their structure can be split up into either physico-chemical or
pharmacological attributes also called descriptors. On top of that, there exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and
synonyms. The
IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable and unique
string although it becomes unwieldy for larger molecules.
Trivial names on the other hand abound with
homonyms and synonyms and are therefore a bad choice as a
defining database key. While physico-chemical descriptors like
molecular weight, (
partial) charge,
solubility, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (
screening,
bioassay) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.
Similarity
There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an
inverse of a
measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in
molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into
Euclidean measures and non-Euclidean measures depending on whether the
triangle inequality holds.
Chemicals in the databases may be
clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived
descriptors. One of the most popular clustering approaches is the
Jarvis-Patrick algorithm (
k-nearest neighbours algorithm).
In
pharmacologically-oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (
ADME/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using
QSAR methods.
Registration systems
Databases systems for maintaining unique records on
chemical compounds are termed as Registration systems. These are often used for chemical indexing,
patent systems and industrial databases.
Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'
canonical' string representations such as 'canonical
SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique
hash codes to achieve the same objective.
A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with
stereochemistry unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or
racemic. Each of these would be considered a different record in a chemical registry system.
Registration systems also preprocess molecules to avoid considering trivial differences such as differences in
halogen ions in chemicals.
An example is the
Chemical Abstracts Service (CAS) registration system
(External Link
). See also
CAS registry number.
Tools
The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations.
There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is
OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and
PostgreSQL based systems make use of
cartridge technology that allows user defined datatypes. These allow the user to make
SQL queries with chemical search conditions (For example a query to search for records having a benzene ring in their structure represented as a SMILES string in a SMILESCOL column could be
» SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1').
Algorithms for the conversion of
IUPAC names to structure representations and vice versa are also used for
extracting structural information from text. However there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See
InChI).
Further Information
Get more info on 'Chemical Database'.
|
External Link Exchanges
Do you know how hard it is to get a link from a large encyclopaedia? Well we're different and will prove it. To get a link from us just add the following HTML to your site on a relevant page:
<a href="http://chemical_database.totallyexplained.com">Chemical database Totally Explained</a>
Then simply click through this link from your web page. Our crawlers will verify your link, extract the title of your web page and instantly add a link back to it. If you like you can remove the words Totally Explained and embed the link in article text.
As long as your link remains in place, we'll keep our link to you right here. Please play fair - our crawlers are watching. Your site must be closely related to this one's topic. Any kind of spamming, dubious practises or removing the link will result in your link from us being dropped and, potentially, your whole site being banned. |